Example: Patient Satisfaction

A hospital administrator wished to study the relation between patient satisfaction (a measured index score between 0-100) and patient’s age (measured in years), severity of illness (an index), anxiety level (an index), and sex (coded as 0=male, 1=female). Larger values of the index variables are associated with more satisfaction, increased severity of illness and more anxiety. The administrator randomly selected 46 patients and collected the data. The data appear in the text file patientsatisfaction.txt.

Read in the data and take a quick look:

patsatdat <- read.table("patientsatisfaction.txt", header=TRUE)
kable(head(patsatdat))
satis.index age ill.sev.index anx.index sex
48 50 51 2.3 0
57 36 46 2.3 0
66 40 48 2.2 0
70 41 44 1.8 0
89 28 43 1.8 1
36 49 54 2.9 0

What is different about this compared to the problems we were seeing in Chapters 2-4?

Statisticians use regression to build models to investigate the nature of such relationships amongst variables. When there are multiple predictors being simultaneousy considered, the technique is known as multiple regression modeling.

The goals of an analysis of these data are to:

  1. Develop a model that explains the relationship between these predictors and patient satisfaction as best we can. (Note what we just said there … we will develop a model through a proces of assessing the “goodness” of a model. The model is not determined based on some designed experiment!)
  2. Use the model to make inferences about the population that the selected sample of patients represents.

The first item in the list above necessitates the importance of knowing the form of the regression model being fitted, because we may try several different models for a given data set. The basic form of a main effects multiple linear regression (MLR) model with \(k\) predictors is:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_k X_k + \varepsilon\]

In the present context, this model is given by

\[Patient Satis Score = \beta_0 + \beta_1 (Age) + \beta_2 (Illness Severity Index) + \beta_3 (Anxiety Index) + \beta_4 (Sex) + \varepsilon\]

EDA: Correlation analysis

While it is still important to observe the distribution of the individual variables in the study (like we have done in prior EDAs), the main focus of a regression analysis is to study associations between variables. This can be effectively done via a scatterplot matrix (available in GGally with the ggscatmat() function), which displays the pairwise associations that exist in the sample data, along with Pearson correlation coefficients:

ggscatmat(patsatdat)

Correlation between two variables measures the amount of linear relationship between the two variables. A correlation is typically denoated with an \(r\) (e.g., \(r=-0.79\) for the satis.index and age variables). As a reminder, any correlation is bounded between (-1, 1) and the distance from 0 provides a magnitude of the strength and direction of the bivariate linear relationship (thus age and satis.index have a moderately strong negative relationship).

What are some things to notice in the above EDA?

  • Do any of the variables appear to be heavily skewed, or are there any unusual observations (outliers) to be concerned about?
  • Satisfaction appears to be negatively correlated with age, illness and anxiety levels. Does this make intuitive sense?
  • Does the correlation between satisfaction index and sex make any sense?
  • Does it matter that it appears that the predictors look to be related to each other?

Over the coming weeks we will answer all of these (and more!) questions by covering various statistical methods.

Now, how do we determine how to fit a model of the form above to these data? For that, we consider a simple dataset with just one predictor variable…


Simple Linear Regression Example

Manatees are large, gentle sea creatures that live along the Florida coast. The number of manatees killed by motorboats in the state of Florida has continually been monitored since 1974 by the Florida Fish and Wildlife Conservation Commission. The number of motor boats registered within the state since 1977 is available from the state’s Department of Motor Vehicles. The original data are attribued to Elementary Statistics, 9th edition by Mario Triola but have since been updated. The data are in the file manatee.csv on the canvas site.

Below is a scatterplot of the raw data. Note our creativity in choice of a plotting symbol:

Visually it looks as if there is a upward trending line (more boats registered, more manatees killed). Suppose we decide to fit the following model to the data:

\[Y = \beta_0 + \beta_1 X + \varepsilon\]

where \(Y\) = the number of manatees killed and \(X\) is the number of boats registered in year \(i\).

How do we estimate the coefficients \(\beta_0\) and \(\beta_1\)? Let’s consider different line “fits” to the data in the following interactive display. You may also see details of each data point by hovering your cursor over each point:

What choice appears to be the best? It’s hard to say. For a bit more focus, let’s consider just two specific competing fitted lines:

The plot above includes a depiction of the model “error” for each point for each of the two fitted lines, calculated from the observed number of Manatees Killed as compared to the fitted line (which produces a model-based predicted number of Mantees Killed). Ideally, we want a line with a “smaller” errors overall. We find the best such model using the method of least squares, where we minimize the equation

\[\begin{eqnarray*} RSS &=& \sum_{i=1}^n \left(\textrm{Error at point } i\right)^2 \\ &=& \sum_{i=1}^n \left(e_i\right)^2 \\ &=& \sum_{i=1}^n \left(Y_i - \hat{Y}_i\right)^2 \\ &=& \sum_{i=1}^n \left(Y_i - b_0 - b_1 X_i\right)^2 \end{eqnarray*}\]

For the two models fitted in the display above, the RSS for each is

Group RSS
Model 1 4398.861
Model 2 3222.810

So Model 2 is a better fit than Model 1 (you can see that visually as well) … but is it the best?

To determine the best fitting model overall, we need to use calculus with the above RSS function and find the optimal values of \(b_0\) and \(b_1\). In R, we do this with the lm() function. Consider the following:

lm(ManateesKilled ~ BoatsRegistered, data=manatee)
## 
## Call:
## lm(formula = ManateesKilled ~ BoatsRegistered, data = manatee)
## 
## Coefficients:
##     (Intercept)  BoatsRegistered  
##        -43.5956           0.1326

Thus, for every boat registered, we can expect 0.1326 manatees to be killed by motor boats. Or to phrase it another way, for every 100 boats registered, you can expect approximately 13 manatees on average to be killed by motorboats.

Note the intercept here is conceptually meaningless: it corresponds to the expected number of manatees killed by motorboats when zero motorboats are registered in the state of Florida. (This is a common phenomenon of regression – the intercept often tends to have no contextual interpretation.)


Multiple regression example: Patient Satisfaction

We will fit the multiple regression model

\[Patient Satis Score = \beta_0 + \beta_1 (Age) + \beta_2 (Illness Severity Index) + \beta_3 (Anxiety Index) + \beta_4 (Sex) + \varepsilon\]

to the data using R and the function lm. For now, we will forego the checking of assumptions (we’ll see this very soon!) because the validity of the assumptions underlying the model mainly pertain to using the model for statistical inference purposes (Chapter 6).

mr.model1 <- lm(satis.index ~ age + ill.sev.index + anx.index + sex, data=patsatdat)
summary(mr.model1)
## 
## Call:
## lm(formula = satis.index ~ age + ill.sev.index + anx.index + 
##     sex, data = patsatdat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.347  -6.417   0.520   8.374  17.165 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   158.49433   18.36268   8.631 9.16e-11 ***
## age            -1.14173    0.21951  -5.201 5.86e-06 ***
## ill.sev.index  -0.44201    0.49793  -0.888   0.3799    
## anx.index     -13.46700    7.23154  -1.862   0.0697 .  
## sex            -0.01183    3.04192  -0.004   0.9969    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.18 on 41 degrees of freedom
## Multiple R-squared:  0.6822, Adjusted R-squared:  0.6512 
## F-statistic:    22 on 4 and 41 DF,  p-value: 9.332e-10

Several interpretive things to take note of here:

In the text, we discuss the practical challenges in interpreting the \(\beta\)-parameter estimates in the model fit to observational data. Nonetheless, here is an interpretation of them in the present context. We will discuss in class what the practical challenges are, however:

Now, what are some of the issues with practical interpretation of these parameter estimates?